Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax. I am feeling good. I went through the Datacamp tutorial, technically I knew that stuff at some point, but since I never use it, it is a good repetition. I am working somewhat with data science, but I am already so deep into specific science that I think it is useful to take a course by the way that deals with fundamentals. https://github.com/Supervitux/IODS-project

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Tue Dec  1 14:47:54 2020"

The text continues here.


2. Regression and model validation

1. Read in and describe the data

learning2014 <- read.csv("data/learning2014.csv")
str(learning2014)
## 'data.frame':    166 obs. of  8 variables:
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166   8

The data consists of 166 observations for 7 variables (the 8th variable is merely an index here). The data has been collected over the course Introduction to Social Statistics in 2014. Non-selfexplanatory columns are:

  • “deep”: Deep learning evaluation
  • “stra”: Strategic learning evaluation
  • “surf”; Surface learning evaluation

More details available at http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt

2. Graphical overview of the data

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p

summary(learning2014)
##        X          gender       Age           Attitude          deep      
##  Min.   :  1.00   F:110   Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  1st Qu.: 42.25   M: 56   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Median : 83.50           Median :22.00   Median :32.00   Median :3.667  
##  Mean   : 83.50           Mean   :25.51   Mean   :31.43   Mean   :3.680  
##  3rd Qu.:124.75           3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##  Max.   :166.00           Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Blue represents data where gender is “male”, red where gender is “female”. The table overview shows that the data consists to only one third of data with gender being “male”, The graphical overview shows that higher points correlate with a better attitude. Otherwise barely any correlation can be stated visually.

3. Regression model

my_model2 <- lm(Points ~ Attitude + stra + Age , data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + Age, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## Attitude     0.34808    0.05622   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## Age         -0.08822    0.05302  -1.664   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The summary shows the results of two statistical tests. First of all the t-value tell us about how far our estimated parameter is from a hypothesized 0. This the highest for “stra”. The p-value (Pr) tells us about the probability of observing a value at least as extreme as our coefficient and usually should be below a 0.05 threshold. The summary of the model shows that only “Attitude” has a low enough p-value to reject the null hypothesis and therefore has a significant relationship with our target “Points”. Therefore we drop the features “stra” and “Age” and rerun the model:

my_model2 <- lm(Points ~ Attitude, data = learning2014)
summary(my_model2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now our model is statistically significant.

4. Interpretation of the model

The intercept tells us that someone with 0 attitude will have 11.6 points. From there for each “Attitude” increment the “Points” will rise by 0.35. Someone with the maximum of 50 in attitude will have a predicted point score of 17 + 11 = 28. We have a linear relationship here. The multiple r-squared is always between 0 and 1 and the higher the more variation in the response variable is explained by the model. The same is true for the adjusted r-squared just that it takes into account the number of independent variables. Our model is therefore not well suited to explain this variation as the value is only 0.19.

5. Plot Residuals

plot(my_model2, which = 1)

Ideally the residual would be as close to 0 as possible. We have a quite big spreading, which is also reflected by our poor r-squared.

The normal Q-Q plot plots the quantiles of two different probability distributions against each other. Ideally all the values are on the diagonal:

plot(my_model2, which = 2)

We can see that for the outer quantiles the values diverge from the quantiles.

Leverage describes how far a covariate is from other covariates, respectively for the present assumed linear regression it measures how sensitive a predicted value is to a change of the corresponding true value:

plot(my_model2, which = 5)

The points are more or less evenly spread.


3. Logistic Regression

2. Read in and describe the data

alc <- read.csv("data/alc.csv")
colnames(alc)
##  [1] "X"          "school"     "sex"        "age"        "address"   
##  [6] "famsize"    "Pstatus"    "Medu"       "Fedu"       "Mjob"      
## [11] "Fjob"       "reason"     "nursery"    "internet"   "guardian"  
## [16] "traveltime" "studytime"  "failures"   "schoolsup"  "famsup"    
## [21] "paid"       "activities" "higher"     "romantic"   "famrel"    
## [26] "freetime"   "goout"      "Dalc"       "Walc"       "health"    
## [31] "absences"   "G1"         "G2"         "G3"         "alc_use"   
## [36] "high_use"

The data contains information about student achievements in secondary eduction of two Portugese schools. It includes students grades, demographic, social and school related features, which were collected via questionnaries. More information at: https://archive.ics.uci.edu/ml/datasets/Student+Performance

3. Hypotheses for relationships of alcohol consumption to chosen variables

Personally I would except students with high alcohol consumption (high_use) to have worse grades (G3), more absences (absences), worse health (health) and more class failures (failures).

4. Explore Hypotheses

Let’s explore these hypotheses, first numerically:

table(high_use = alc$high_use, grade = alc$G3)
##         grade
## high_use  0  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
##    FALSE  1  0  1  6  4 15  4 19  3 38 16 54 19 31 15 30  5  7
##    TRUE   1  1  0  1  2  7  2  8  4 26 10 27  7  7  1  8  2  0
table(high_use = alc$high_use, absences = alc$absences)
##         absences
## high_use  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 16 17 18 19 20 21 26 27
##    FALSE 52 38 42 33 24 16 16  9 14  6  5  2  4  1  1  0  0  1  0  2  1  0  0
##    TRUE  13 13 16  8 12  6  5  3  6  6  2  4  4  1  6  1  1  1  1  0  1  1  1
##         absences
## high_use 29 44 45
##    FALSE  0  0  1
##    TRUE   1  1  0
table(high_use = alc$high_use, health = alc$health)
##         health
## high_use  1  2  3  4  5
##    FALSE 35 28 62 49 94
##    TRUE  11 16 19 18 50
table(high_use = alc$high_use, failures = alc$failures)
##         failures
## high_use   0   1   2   3
##    FALSE 244  12  10   2
##    TRUE   90  12   9   3

The numerical observation confirms my hypotheses, however, a graphical exploration should make it simpler to draw an actual comparison:

library(ggplot2)
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")   + ggtitle("Student grades by alcohol consumption and sex")

g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot() + ylab("absences")   + ggtitle("Student absences by alcohol consumption and sex")

g3 <- ggplot(alc, aes(x = high_use, y = health, col = sex))
g3 + geom_boxplot() + ylab("health")   + ggtitle("Student health by alcohol consumption and sex")

g4 <- ggplot(alc, aes(x = high_use, y = failures, col = sex))
g4 + geom_boxplot() + ylab("failures")   + ggtitle("Student failures by alcohol consumption and sex")

The grades for students with high alcohol consumption seem to be generally lower. Interestingly the mean for females with high alcohol consumption is as high as the mean for females without, it seems that there are some very well performing females with high alcohol consumption dragging the mean up. We observe the same looking at absences. Male health seems to be not affected by alcohol usage, while female health generally is worse for high consumption, the mean is actually lower indicating that some females without high alcohol consumption show exceptionally bad health. Eventually alcohol consumption seems to basically noy correlate with class failure.

5. Logistic Regression

m <- glm(high_use ~ G3 + absences + health + failures, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + health + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3178  -0.7943  -0.7015   1.1814   1.9006  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.13164    0.58665  -1.929 0.053731 .  
## G3          -0.04286    0.03791  -1.131 0.258225    
## absences     0.08458    0.02283   3.705 0.000212 ***
## health       0.07387    0.08602   0.859 0.390474    
## failures     0.39341    0.19912   1.976 0.048184 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 437.96  on 377  degrees of freedom
## AIC: 447.96
## 
## Number of Fisher Scoring iterations: 4

The logistic regression model contradicts my earlier conclusions and shows that health and grades are not statistically significant while failures is. Consequently I drop health and grades from the model.

m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2680  -0.8019  -0.6967   1.2522   1.7906  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.37829    0.16479  -8.364  < 2e-16 ***
## failures     0.49972    0.18456   2.708 0.006776 ** 
## absences     0.08602    0.02287   3.762 0.000169 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 440.14  on 379  degrees of freedom
## AIC: 446.14
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)    failures    absences 
## -1.37828639  0.49971697  0.08601774

Let’s turn the coefficients into odds ratios with confidence intervals:

library(dplyr)
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp

# print out the odds ratios with their confidence intervals
cbind(OR, CI)

The odd ratios strongly confirm the relationship of failures: An individual with high alcohol consumption is 1.6 times more likely to have a failure. The impact of absences is less strongly confirmed and only slight

6. Predictive Power

Let’s make predictions with our two variables:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   10
##    TRUE     99   15

The cross-tabulation shows that our model has little false negatives, however, produces a high number of false positives. The training error is 109/382 = 0.285. This better than simple guessing, which would result in an error of 0.5.

7. Cross-validation

For crossvalidation we need to define a const function:

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

next we perform 10 fold crossvalidation:

library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2853403

Crossvalidation seems to make the prediction worse in the present case.


4. Clustering and Classification

2. Load and describe the data

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The data contains information about housing values in suburbs of Boston. It contains information regarding crime rates, proportions of residental land/non-retail and various other variables influencing housing values, detailed information here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html There are 506 observables of 14 variables.

3. Graphical overview

For obtaining an graphical overview of the data it is here useful to plot a Correlation matrix additional to the summary.

library(dplyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  3.0.4     ✓ purrr   0.3.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(corrplot)
## corrplot 0.84 loaded
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

crim, that is crime rates, is the only variable the spans over several order of magnitudes. Otherwise most of the data is in a range of tens and hundreds. There is lots of correlation within the data. Only exception is chas, (Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)). Also black (measure for the proprtion of black population) shows rather small correlationss. We have various clear correlations that feel intuitively right, such as indus (industry proportion) with nox (nitrogen oxide concentration) or a negative correlation of lstat (lower status of populatrion in %) to medv (median value of homes).

4. Standardize Data

We will standardize the data:

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Standardizing set all data means to zero. Now we turn crim into a categorical variable, based on its quantiles. Finally we will put 80% of the data into a train set and the rest to a testset.

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low","med_low","med_high","high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(Boston)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

The categorical crime data is split into four categories of low, medium low, medium high and high crime rates.

5. Linear Discriminant Analysis

Let*s fit a Linear Discriminant Analysis model on the data:

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

Especially the high and medium high crime rates look classified together. Nevertheless, in this particular plot low and medium low seem very mixed.

6. Predictions onto the testset

Let*s fit a Linear Discriminant Analysis model on the data:

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       16       2        2    0
##   med_low    5      16        4    0
##   med_high   0      11       13    1
##   high       0       0        0   32

We see that high and medium high was all correct predicted with one exception. However, only roughly a third of medium low was correctly predicted, where one other third was misspredicted as low and one as medium high, this is what we expected from the figure above. Also One third of “low” was misspredicted as medium low.

7. Data Distances

Let’s reload the dataset and calculate distances between the variables and then run the k-means algorithm

# load MASS and Boston
library(MASS)
data('Boston')

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(boston_scaled)

# k-means clustering
km <-kmeans(boston_scaled, centers = 3)

# plot the Boston dataset with clusters
pairs(boston_scaled[6:10], col = km$cluster)

With 3 clusters we get some distinction of the datapoints, however, the graph makes clear that it could be separated more and better. Therefore, we observe the total of within cluster sum of squares (WCSS). When this measure is lowest we fidn the optimal number of clusters.

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The graph suggests that we may find a better results with 10 clusters. However, that may be a overfit, so we take 8 instead as there is a rapid decline in the curve right before 8 clusters.

# k-means clustering
km <-kmeans(boston_scaled, centers = 8)

# plot the Boston dataset with clusters
pairs(boston_scaled[6:10], col = km$cluster)

Visually the result got worse, because it is way more difficult to achieve any interpretation of the result. With 8 clusters therer is already some overfit. Nevertheless, the brightblue cluster seems to be well described.


5. Principal Component Analysis

1. Graphical overviews and summaries

library(GGally)
humans <- read.csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", row.names = 1, header= TRUE)
summary(humans)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(humans)

The summary shows us that we have a variety of ranges for our data. GNI and Mat.Mor are widely spread ranging over three orders of magnitude. The other variables are more compact. The plot shows us that we have fairly strong correlations across the board. So does Ado.Birth correlate strongly with maternal mortality, life expectancy as well as education. However, e.g. female labour shows weak to none correlation.

2. PCA on NON standardized data

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(humans)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

3. PCA on standardized data

# standardize the variables
human_std <- scale(humans)
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2")) + title("The result is very different as it is standardized. We see that female labor and female parliament representation strongly correlate, but are independent of all the other variables. All other variables either correlate or negatively correlate with each other. Therefore PCA1 basically describes these first two variables and PCA2 all the others", line = -20)

## integer(0)

“As we see, we see nothing” in the non standaridzed PCA: GNI’s variance and magnitude is so large that all the other variables become irrelevant in the PCA. The PCA is meaningless.In the standardized PCA though we get the variances clearly separated. As for some reason my caption keeps being cut: “The result is very different as it is standardized. We see that female labor and female parliament representation strongly correlate, but are independent of all the other variables. All other variables either correlate or negatively correlate with each other. Therefore PCA1 basically describes these first two variables and PCA2 all the others”

4. Interpretation on PCA

I just can repeat what was said in the previous task: We see that female labor and female parliament representation strongly correlate, but are independent of all the other variables. All other variables either correlate or negatively correlate with each other. Therefore PCA1 basically deals with gender data, whereas PCA2 describes general life variables that are gender independent.

5. Multiple Correspondence Analysis

library(FactoMineR)
library(ggplot2)
library(dplyr)
library(tidyr)
data("tea")
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The dataset includes data about the tea drinking habits of 300 people. The habits are the tea sort, how it is drunk, how it is served, where it is bought and if it is drank for lunch. Let’s do a multiple correspondence analysis:

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The MCA shows that unpackaged tea is most likely purchased in a tea shop, whereas tea in tea bgas is bought in chain stores. Also interesting is that Earl Grey tea is the most likely tea to be had with milk, whereas green tea is consumed alone and black tea rather with lemon than the others.


6. Analysis of longitudinal data

1. RATS data

a) Read in data:

library(tidyr)
library(dplyr)
library(ggplot2)
RATS <- read.csv("data/rats.csv", row.names = 1, header= TRUE)
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
summary(RATS)
##        ID      Group        WD         Weight           Time      
##  1      : 11   1:88   WD1    :16   Min.   :225.0   Min.   : 1.00  
##  2      : 11   2:44   WD15   :16   1st Qu.:267.0   1st Qu.:15.00  
##  3      : 11   3:44   WD22   :16   Median :344.5   Median :36.00  
##  4      : 11          WD29   :16   Mean   :384.5   Mean   :33.55  
##  5      : 11          WD36   :16   3rd Qu.:511.2   3rd Qu.:50.00  
##  6      : 11          WD43   :16   Max.   :628.0   Max.   :64.00  
##  (Other):110          (Other):80
str(RATS)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...

This is data from a nutrition study conducted on three groups of rats. The groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately) weekly, except in week seven when two recordings were taken) over a 9-week period . Let’s create a graphical representation to understand the data better:

p1 <- ggplot(RATS, aes(x = Time, y = Weight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(RATS$Weight),max(RATS$Weight)))
p6

A visual assessment of the graphs shows, that rats in group 1 with weight < 300 did not change their weight dramatically, there is possibly a slight decrease. However, individuals with higher weight in the beginning of the study, especially in group 2 seemed to have increased their weight over time. In order to make a general assessment of weight changes we will summarize the rats according to their group.

b) Summarize Groups

n <- RATS$Weight %>% unique() %>% length()

RATSS <- RATS %>% group_by(Group, Time) %>% summarise( mean=mean(Weight), se=sd(Weight)/sqrt(n)) %>% ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATSS)
## Rows: 33
## Columns: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se    <dbl> 1.522158, 1.309307, 1.147591, 1.360081, 1.105748, 1.178377, 1.0…
p1 <- ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + geom_point(size=1) + scale_shape_manual(values = rep(1:10, times=4))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
p6

The graphical summary of the data shows that within the groups the weight varies only by little. Overtime, group 1 has almost no increase, while group 2 and 3 increase more significantly. Group 2 varies the most, in fact it is the only data for which the errorbars are easy to spot by eye. Step by step we can condense the data further into more concise graphs:

p1 <- ggplot(RATS, aes(x = factor(Time), y = Weight, fill = Group))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + scale_x_discrete(name = "Time (days)")
p4

The rows of boxplots show that all groups contain outliers, or in this case one rat is an outlier for each groups. Nevertheless, it once more confirms earlier suspicions and so does the following even simpler graph:

# Make a summary data of the post treatment weeks (1-8)
RATS8S <- RATS %>% filter(Time > 0) %>% group_by(Group, ID) %>% summarise( mean=mean(Weight) ) %>% ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
glimpse(RATS8S)
## Rows: 16
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274…
p1 <- ggplot(RATS8S, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
## Warning: `fun.y` is deprecated. Use `fun` instead.
p5 <- p4 + scale_y_continuous(name = "mean(bprs), weeks 1-8")
p5

So lets see what happens if we remove these outliers:

RATS8S1 <- RATS8S %>% filter( (mean > 240 & Group == 1) | (mean < 480 & Group == 2) | (mean > 500 & Group == 3) )
glimpse(RATS8S1)
## Rows: 13
## Columns: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID    <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean  <dbl> 261.0909, 260.1818, 266.5455, 269.4545, 274.7273, 274.6364, 265…
p1 <- ggplot(RATS8S1, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white")
p5 <- p4 + scale_y_continuous(name = "mean(Weight)")
p5

### c) Make a fit

After removal of the outliers it becomes obvious that all the variance in Group 2 was basically caused by this one outlier. With a precise summary of our data at hand we can go ahead and see if we can fit a model onto it:

# For establishing a baseline raw data is needed
RATSRAW <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", stringsAsFactors = F, sep =  "\t")
# Once again remove outliers
RATSRAW <- RATSRAW  %>% filter( (! ID == 2) & (! ID == 12) & (! ID == 13) )
baseline <- RATSRAW$WD1
RATS8S2 <- RATS8S1 %>%
  mutate(baseline)
# Fit the ANCOVA model and see the results:
fit <- lm(mean ~ baseline + Group, data = RATS8S2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATS8S2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0310 -2.6287  0.1002  1.8269  7.1808 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.1903    25.7382   7.817 2.66e-05 ***
## baseline      0.2605     0.1010   2.580   0.0297 *  
## Group2      138.8380    17.0411   8.147 1.91e-05 ***
## Group3      199.6530    27.1916   7.342 4.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.669 on 9 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9985 
## F-statistic:  2692 on 3 and 9 DF,  p-value: 1.324e-13
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## baseline   1 174396  174396 7999.137 1.384e-14 ***
## Group      2   1707     853   39.147 3.628e-05 ***
## Residuals  9    196      22                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

E voilá, finally we have a model with very good significance!

2. BPRS data

a) Read in data:

BPRS <- read.csv("data/bprs.csv", row.names = 1, header= TRUE)
BPRS <- within(BPRS, {
       subject <- factor(subject)
    treatment <- factor(treatment)
})
str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...

This data is a psychological study, in which 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks. After transformation in the “Data wrangling” it has the form of a data frame with 360 obs of 7 variables, which are treatment, subject, weeks, bprs and group.

b) Visualize data

Let’s make some graphs to better understand the data!

p1 <- ggplot(BPRS, aes(x = week, y = bprs, group = treatment))
p2 <- p1 + geom_text(aes(label = treatment))
p3 <- p2 + scale_x_continuous(name = "Week", breaks = seq(0, 60, 10))
p4 <- p3 + scale_y_continuous(name = "BPRS score")
p5 <- p4 + theme_bw()
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

Through this interesting plot we can see that subjects in the treatment group 2 tend to at least occupy the maxima in bprs score throughout the study. However, it is really hard to say something about the middle and low end. Before we improve the visualizations let’s quickly do a linear regression to see, if that can be improved as well:

BPRS_reg <- lm(bprs ~ week + treatment, data = BPRS)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

We immediately see that there is lots of space for improvement as “treatment” is not significant at all! Let’s try some visualizations:

p2 <- ggplot(BPRS, aes(x = week, y = bprs, linetype = treatment)) + geom_line()
p3 <- p2 + scale_x_continuous(name = "Week", breaks = seq(0, 60, 10))
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw() + theme(legend.position = "top")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

This graph is simply useless. Another try is the pairs plot

BPRS_raw <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", stringsAsFactors = F, sep = " ")

pairs <- pairs(BPRS_raw[, 3:11], cex = 0.7)

This plot show how the BPRS evolved through time and we see that it overall decrease from week 0 to week 8.

b) Mixed Effect Models

As seen earlier a simple linear model does not perform well on our data. An better option is to use a Random Intercept Model because it enables independent intercepts for each subject in our data:

library("lme4")
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000

Alternatively we can have the model also differ for each subject in slope!

BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova shows us that we already reached some significance with a P value less that 0.03, but that can possibly still be improved by considering the interaction of the week with the treatment:

BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRS, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0768  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.9964  8.0620        
##           week         0.9687  0.9842   -0.51
##  Residual             96.4707  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2521  21.262
## week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmn2
## week        -0.650              
## treatment2  -0.424  0.469       
## wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
##           npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRS_ref1    7 2745.4 2772.6 -1365.7   2731.4                       
## BPRS_ref2    8 2744.3 2775.4 -1364.1   2728.3 3.1712  1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interestingly enough the interaction made the model worse. That means the the correlation of these variables is very week and should not be considered. Therefore, lets produce some plots based on the previous models.

Fitted <- fitted(BPRS_ref1)
BPRS <- BPRS %>% mutate(Fitted)

p1 <- ggplot(BPRS, aes(x = week, y = bprs, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ treatment, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs)))
graph1 <- p6

p1 <- ggplot(BPRS, aes(x = week, y = Fitted, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ treatment, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(BPRS$bprs), max(BPRS$bprs)))
graph2 <- p6

graph1; graph2

We were able to make the best fits possible! I hope I will be forgiven that I was not able to plot it in single plots, because the almighty stackoverflow was not able to save me this time :( .